[SPARK-4131][SQL] support writing data into the filesystem from queries#4380
[SPARK-4131][SQL] support writing data into the filesystem from queries#4380scwf wants to merge 17 commits intoapache:masterfrom
Conversation
|
Test build #26814 has finished for PR 4380 at commit
|
There was a problem hiding this comment.
seems this is not necessary
|
Test build #26918 has finished for PR 4380 at commit
|
|
Test build #26919 has finished for PR 4380 at commit
|
|
/cc @liancheng can you help review this? |
|
Test build #26982 has finished for PR 4380 at commit
|
|
Test build #26984 has finished for PR 4380 at commit
|
|
Test build #26983 has finished for PR 4380 at commit
|
|
retest this please |
There was a problem hiding this comment.
hi, @scwf
FYI. The DescribeCommand has been moved into sources.ddl.scala, can remove this from here?
There was a problem hiding this comment.
yeah, my bad, should remove it from here
|
Test build #26990 has finished for PR 4380 at commit
|
|
Test build #26993 has finished for PR 4380 at commit
|
|
Actually @yhuai's work doesn't cover your concerns entirely, but you can build yours upon his. Basically it's a |
|
Yeah getit, but that maybe much later, is it possible to let this in for transition since |
|
ping. any update here? |
|
@marmbrus, according to liancheng's suggestion this PR rely on the Hive SerDe data source(not implemented), so now this version added a 'WriteToDirs ' to implement this feature. could you have a look and give some feed back ? |
There was a problem hiding this comment.
+1
I think my biggest feedback is that this seems like a lot of new code, given we already have the ability to write data using arbitrary SerDes to a file system. Can we acomplish the same thing with only small changes to the parser and maybe a small logical / physical node? Or am i missing something?
There was a problem hiding this comment.
Yes in InsertIntoHiveTable we can write data using arbitrary SerDes to a file system. But we can not reuse InsertIntoHiveTable to share the same physical plan since
1 InsertIntoHiveTable and WriteToDirectory have different input
2 InsertIntoHiveTable has complex logical to handle dynamic partitioning
So i think we can extract a common interface(SaveAsHiveFile) to reuse the code of writing data to file system.
|
Test build #30628 has started for PR 4380 at commit |
|
Test build #30633 has started for PR 4380 at commit |
|
/cc @marmbrus |
|
Test build #31424 has started for PR 4380 at commit |
|
Updated, to summarize this: 2 Analyze WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, ASTNode) to WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, desc: TableDesc) in hive context analyzer 3 Transform WriteToDirectory(path: String, child: LogicalPlan, isLocal: Boolean, desc: TableDesc) to execution.WriteToDirectory when query planning 4 Extract a common Interface SaveAsHiveFile to share code of writting data to FS |
|
ping |
|
Test build #33078 has started for PR 4380 at commit |
There was a problem hiding this comment.
We shouldn't be holding onto the AST anymore, as this ties us a little too closely to a specific version of Hive. Instead look at how we are doing things in CreateTableAsSelect.
Also, please add a test case where you use this new operation on a Spark SQL temporary table.
There was a problem hiding this comment.
yeah thanks i am updating this
|
Test build #33232 has started for PR 4380 at commit |
|
Test build #38789 has finished for PR 4380 at commit
|
|
retest this please |
|
Test build #38811 has finished for PR 4380 at commit
|
|
Test build #144 has finished for PR 4380 at commit
|
|
Test build #39628 has finished for PR 4380 at commit
|
|
Test build #40233 has finished for PR 4380 at commit
|
|
@scwf looks like there is a real failure? |
|
@rxin yes, since we upgrade the hive version to 1.2.1, we should adapt the token tree in hiveql, the old one is not correct in 1.2.1. Updated |
|
retest this please |
|
Test build #40600 has finished for PR 4380 at commit
|
|
@scwf @liancheng Is there any plan to merge this PR to branch-1.5 recently? I think this feature is pretty useful. |
|
@litao-buptsse Since it is a new feature, I think we will not get it in branch 1.5. But, we can target 1.6. |
|
@yhuai Got it, thank you very much. |
|
@scwf @yhuai I apply this patch to my branch-1.5 code. It works! But I found a bug. When I use lower case "local", it will try to insert to hdfs file system. When I use upper case "LOCAL", it insert to local file system correctly. |
@scwf @yhuai For the code above, should change "LOCAL" to |
|
@scwf @yhuai It works in "local" mode, but not well in "yarn-client" mode. |
|
@litao-buptsse, i will update this soon thanks. |
|
@yhuai @liancheng is this still relevant? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
|
I think it's a useful feature and widely used in hive. Why not finish this feature and merge it to branch-1.6? |
Support writing data into the filesystem from queries
syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory [ROW FORMAT row_format] STORED AS file_format SELECT ... FROM ...